Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain zero-mutation samples #44

Merged
merged 4 commits into from
Apr 16, 2018
Merged

Conversation

dhimmel
Copy link
Member

@dhimmel dhimmel commented Apr 12, 2018

Refs #43 (comment)

Note this may not retain all samples without mutations but with sequencing. However, it does retain all samples that we're aware of via Xena that have been sequenced.

@dhimmel
Copy link
Member Author

dhimmel commented Apr 12, 2018

Small code change as seen in diff for scripts/2.TCGA-process.py enables retaining samples that:

  1. have mutation calls (potentially silent mutations)
  2. but do not not have red or blue mutations.

This increases the complete mutation matrix to 9104 samples from 9093 and the aligned mutation matrix to 8397 from 8388. So we're gaining 9 samples for cognoma.

You can see specific samples added in the diff for data/samples.tsv. None are ovarian cancer.

@dhimmel
Copy link
Member Author

dhimmel commented Apr 13, 2018

In 9f9f675 I added a diseases.tsv file under data with summary information for each cancer. It's useful for tracking sample numbers by cancer type for the various datasets.

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One typo

The updates look good. The resulting binary matrix represents a list of high confidence mutation calls for all samples with matching mutation, gene expression, and clinical data.

I am curious, are there any samples in either mutation or gene expression data that are not in the clinical data? This seems unlikely, but we probably don't want to filter these samples either (we can infer its acronym.)

sample_ids is a pandas.Series
"""
sample_ids = pandas.Series(sample_ids)
aconyms = sample_ids.map(sample_to_acronym)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aconyms --> acronyms

@dhimmel
Copy link
Member Author

dhimmel commented Apr 16, 2018

are there any samples in either mutation or gene expression data that are not in the clinical data?

I used the following code:

samples_missing_clinical = sorted((gene_mutation_mat_df.index & expr_df.index).difference(clinmat_df.index))
for sample_id in samples_missing_clinical:
    print(sample_id)
len(samples_missing_clinical)

Turns out there are 389 missing samples:

mutation or expression samples missing clinical data
TCGA-28-2510-01
TCGA-2G-AAKO-01
TCGA-2G-AALF-01
TCGA-2G-AALG-01
TCGA-2G-AALN-01
TCGA-2G-AALO-01
TCGA-2G-AALQ-01
TCGA-2G-AALR-01
TCGA-2G-AALS-01
TCGA-2G-AALT-01
TCGA-2G-AALW-01
TCGA-2G-AALX-01
TCGA-2G-AALY-01
TCGA-2G-AALZ-01
TCGA-2G-AAM2-01
TCGA-2G-AAM3-01
TCGA-2G-AAM4-01
TCGA-3N-A9WB-06
TCGA-3N-A9WC-06
TCGA-3N-A9WD-06
TCGA-5M-AAT5-01
TCGA-5M-AATA-01
TCGA-AR-A0U1-01
TCGA-BF-AAP0-06
TCGA-BH-A0HN-01
TCGA-C4-A0EZ-01
TCGA-C4-A0F1-01
TCGA-C4-A0F7-01
TCGA-D3-A1Q1-06
TCGA-D3-A1Q3-06
TCGA-D3-A1Q4-06
TCGA-D3-A1Q5-06
TCGA-D3-A1Q6-06
TCGA-D3-A1Q7-06
TCGA-D3-A1Q8-06
TCGA-D3-A1Q9-06
TCGA-D3-A1QA-06
TCGA-D3-A1QB-06
TCGA-D3-A2J6-06
TCGA-D3-A2J7-06
TCGA-D3-A2J8-06
TCGA-D3-A2J9-06
TCGA-D3-A2JA-06
TCGA-D3-A2JB-06
TCGA-D3-A2JC-06
TCGA-D3-A2JD-06
TCGA-D3-A2JE-06
TCGA-D3-A2JF-06
TCGA-D3-A2JG-06
TCGA-D3-A2JH-06
TCGA-D3-A2JK-06
TCGA-D3-A2JL-06
TCGA-D3-A2JN-06
TCGA-D3-A2JO-06
TCGA-D3-A2JP-06
TCGA-D3-A3BZ-06
TCGA-D3-A3C1-06
TCGA-D3-A3C3-06
TCGA-D3-A3C6-06
TCGA-D3-A3C7-06
TCGA-D3-A3C8-06
TCGA-D3-A3CB-06
TCGA-D3-A3CC-06
TCGA-D3-A3CE-06
TCGA-D3-A3CF-06
TCGA-D3-A3ML-06
TCGA-D3-A3MO-06
TCGA-D3-A3MR-06
TCGA-D3-A3MU-06
TCGA-D3-A3MV-06
TCGA-D3-A51E-06
TCGA-D3-A51F-06
TCGA-D3-A51G-06
TCGA-D3-A51H-06
TCGA-D3-A51J-06
TCGA-D3-A51K-06
TCGA-D3-A51N-06
TCGA-D3-A51R-06
TCGA-D3-A51T-06
TCGA-D3-A5GL-06
TCGA-D3-A5GN-06
TCGA-D3-A5GO-06
TCGA-D3-A5GR-06
TCGA-D3-A5GS-06
TCGA-D3-A5GU-06
TCGA-D3-A8GB-06
TCGA-D3-A8GC-06
TCGA-D3-A8GD-06
TCGA-D3-A8GE-06
TCGA-D3-A8GI-06
TCGA-D3-A8GJ-06
TCGA-D3-A8GK-06
TCGA-D3-A8GL-06
TCGA-D3-A8GM-06
TCGA-D3-A8GN-06
TCGA-D3-A8GO-06
TCGA-D3-A8GP-06
TCGA-D3-A8GQ-06
TCGA-D3-A8GR-06
TCGA-D3-A8GS-06
TCGA-D3-A8GV-06
TCGA-D9-A148-06
TCGA-D9-A149-06
TCGA-D9-A1JW-06
TCGA-D9-A1JX-06
TCGA-D9-A1X3-06
TCGA-D9-A3Z1-06
TCGA-D9-A3Z3-06
TCGA-D9-A4Z6-06
TCGA-D9-A6E9-06
TCGA-D9-A6EA-06
TCGA-D9-A6EC-06
TCGA-D9-A6EG-06
TCGA-DA-A1HV-06
TCGA-DA-A1HW-06
TCGA-DA-A1HY-06
TCGA-DA-A1I0-06
TCGA-DA-A1I1-06
TCGA-DA-A1I2-06
TCGA-DA-A1I4-06
TCGA-DA-A1I5-06
TCGA-DA-A1I7-06
TCGA-DA-A1I8-06
TCGA-DA-A1IA-06
TCGA-DA-A1IB-06
TCGA-DA-A1IC-06
TCGA-DA-A3F3-06
TCGA-DA-A3F5-06
TCGA-DA-A3F8-06
TCGA-DA-A95V-06
TCGA-DA-A95W-06
TCGA-DA-A95X-06
TCGA-DA-A95Y-06
TCGA-DA-A95Z-06
TCGA-DD-A116-01
TCGA-EB-A44Q-06
TCGA-EB-A44R-06
TCGA-EB-A5KH-06
TCGA-EB-A5SG-06
TCGA-EB-A5SH-06
TCGA-EB-A5UL-06
TCGA-EB-A5UN-06
TCGA-EB-A6L9-06
TCGA-EE-A17X-06
TCGA-EE-A17Y-06
TCGA-EE-A17Z-06
TCGA-EE-A180-06
TCGA-EE-A181-06
TCGA-EE-A182-06
TCGA-EE-A183-06
TCGA-EE-A184-06
TCGA-EE-A185-06
TCGA-EE-A20B-06
TCGA-EE-A20C-06
TCGA-EE-A20F-06
TCGA-EE-A20H-06
TCGA-EE-A20I-06
TCGA-EE-A29A-06
TCGA-EE-A29B-06
TCGA-EE-A29C-06
TCGA-EE-A29D-06
TCGA-EE-A29E-06
TCGA-EE-A29G-06
TCGA-EE-A29H-06
TCGA-EE-A29L-06
TCGA-EE-A29M-06
TCGA-EE-A29N-06
TCGA-EE-A29P-06
TCGA-EE-A29Q-06
TCGA-EE-A29R-06
TCGA-EE-A29S-06
TCGA-EE-A29T-06
TCGA-EE-A29V-06
TCGA-EE-A29W-06
TCGA-EE-A29X-06
TCGA-EE-A2A0-06
TCGA-EE-A2A1-06
TCGA-EE-A2A2-06
TCGA-EE-A2A5-06
TCGA-EE-A2A6-06
TCGA-EE-A2GB-06
TCGA-EE-A2GC-06
TCGA-EE-A2GD-06
TCGA-EE-A2GE-06
TCGA-EE-A2GH-06
TCGA-EE-A2GI-06
TCGA-EE-A2GJ-06
TCGA-EE-A2GK-06
TCGA-EE-A2GL-06
TCGA-EE-A2GM-06
TCGA-EE-A2GN-06
TCGA-EE-A2GO-06
TCGA-EE-A2GP-06
TCGA-EE-A2GR-06
TCGA-EE-A2GS-06
TCGA-EE-A2GT-06
TCGA-EE-A2GU-06
TCGA-EE-A2M5-06
TCGA-EE-A2M6-06
TCGA-EE-A2M7-06
TCGA-EE-A2M8-06
TCGA-EE-A2MC-06
TCGA-EE-A2MD-06
TCGA-EE-A2ME-06
TCGA-EE-A2MF-06
TCGA-EE-A2MG-06
TCGA-EE-A2MH-06
TCGA-EE-A2MI-06
TCGA-EE-A2MJ-06
TCGA-EE-A2MK-06
TCGA-EE-A2ML-06
TCGA-EE-A2MM-06
TCGA-EE-A2MN-06
TCGA-EE-A2MP-06
TCGA-EE-A2MQ-06
TCGA-EE-A2MR-06
TCGA-EE-A2MS-06
TCGA-EE-A2MT-06
TCGA-EE-A2MU-06
TCGA-EE-A3AA-06
TCGA-EE-A3AB-06
TCGA-EE-A3AC-06
TCGA-EE-A3AD-06
TCGA-EE-A3AE-06
TCGA-EE-A3AF-06
TCGA-EE-A3AG-06
TCGA-EE-A3AH-06
TCGA-EE-A3J3-06
TCGA-EE-A3J4-06
TCGA-EE-A3J5-06
TCGA-EE-A3J7-06
TCGA-EE-A3J8-06
TCGA-EE-A3JA-06
TCGA-EE-A3JB-06
TCGA-EE-A3JD-06
TCGA-EE-A3JE-06
TCGA-EE-A3JH-06
TCGA-EE-A3JI-06
TCGA-ER-A193-06
TCGA-ER-A195-06
TCGA-ER-A197-06
TCGA-ER-A198-06
TCGA-ER-A199-06
TCGA-ER-A19A-06
TCGA-ER-A19B-06
TCGA-ER-A19C-06
TCGA-ER-A19D-06
TCGA-ER-A19E-06
TCGA-ER-A19F-06
TCGA-ER-A19G-06
TCGA-ER-A19H-06
TCGA-ER-A19J-06
TCGA-ER-A19L-06
TCGA-ER-A19M-06
TCGA-ER-A19N-06
TCGA-ER-A19O-06
TCGA-ER-A19P-06
TCGA-ER-A19Q-06
TCGA-ER-A19S-06
TCGA-ER-A19W-06
TCGA-ER-A1A1-06
TCGA-ER-A2NC-06
TCGA-ER-A2ND-06
TCGA-ER-A2NE-06
TCGA-ER-A2NG-06
TCGA-ER-A2NH-06
TCGA-ER-A3ES-06
TCGA-ER-A3ET-06
TCGA-ER-A3EV-06
TCGA-ER-A3PL-06
TCGA-ER-A42K-06
TCGA-ER-A42L-06
TCGA-F5-6810-01
TCGA-FR-A3YN-06
TCGA-FR-A3YO-06
TCGA-FR-A44A-06
TCGA-FR-A69P-06
TCGA-FR-A729-06
TCGA-FR-A7U8-06
TCGA-FR-A7U9-06
TCGA-FR-A7UA-06
TCGA-FR-A8YC-06
TCGA-FR-A8YD-06
TCGA-FR-A8YE-06
TCGA-FS-A1YW-06
TCGA-FS-A1YX-06
TCGA-FS-A1YY-06
TCGA-FS-A1Z0-06
TCGA-FS-A1Z3-06
TCGA-FS-A1Z4-06
TCGA-FS-A1Z7-06
TCGA-FS-A1ZA-06
TCGA-FS-A1ZB-06
TCGA-FS-A1ZC-06
TCGA-FS-A1ZD-06
TCGA-FS-A1ZE-06
TCGA-FS-A1ZF-06
TCGA-FS-A1ZG-06
TCGA-FS-A1ZH-06
TCGA-FS-A1ZJ-06
TCGA-FS-A1ZK-06
TCGA-FS-A1ZM-06
TCGA-FS-A1ZP-06
TCGA-FS-A1ZQ-06
TCGA-FS-A1ZR-06
TCGA-FS-A1ZS-06
TCGA-FS-A1ZT-06
TCGA-FS-A1ZU-06
TCGA-FS-A1ZW-06
TCGA-FS-A1ZY-06
TCGA-FS-A1ZZ-06
TCGA-FS-A4F0-06
TCGA-FS-A4F4-06
TCGA-FS-A4F5-06
TCGA-FS-A4F8-06
TCGA-FS-A4F9-06
TCGA-FS-A4FB-06
TCGA-FS-A4FC-06
TCGA-FS-A4FD-06
TCGA-FW-A3I3-06
TCGA-FW-A3R5-06
TCGA-FW-A3TU-06
TCGA-FW-A3TV-06
TCGA-FW-A5DY-06
TCGA-GF-A3OT-06
TCGA-GF-A4EO-06
TCGA-GF-A6C8-06
TCGA-GF-A6C9-06
TCGA-GN-A262-06
TCGA-GN-A264-06
TCGA-GN-A265-06
TCGA-GN-A266-06
TCGA-GN-A267-06
TCGA-GN-A268-06
TCGA-GN-A26A-06
TCGA-GN-A26D-06
TCGA-GN-A4U3-06
TCGA-GN-A4U4-06
TCGA-GN-A4U7-06
TCGA-GN-A4U8-06
TCGA-GN-A4U9-06
TCGA-GN-A8LK-06
TCGA-GN-A8LL-06
TCGA-GN-A9SD-06
TCGA-HR-A2OG-06
TCGA-HR-A2OH-06
TCGA-LH-A9QB-06
TCGA-OD-A75X-06
TCGA-QB-A6FS-06
TCGA-QB-AA9O-06
TCGA-R8-A6YH-01
TCGA-RP-A690-06
TCGA-RP-A693-06
TCGA-RP-A694-06
TCGA-RP-A695-06
TCGA-RP-A6K9-06
TCGA-W3-A824-06
TCGA-W3-A825-06
TCGA-W3-A828-06
TCGA-W3-AA1O-06
TCGA-W3-AA1Q-06
TCGA-W3-AA1R-06
TCGA-W3-AA1V-06
TCGA-W3-AA1W-06
TCGA-W3-AA21-06
TCGA-WE-A8JZ-06
TCGA-WE-A8K1-06
TCGA-WE-A8K5-06
TCGA-WE-A8K6-06
TCGA-WE-A8ZM-06
TCGA-WE-A8ZN-06
TCGA-WE-A8ZO-06
TCGA-WE-A8ZQ-06
TCGA-WE-A8ZR-06
TCGA-WE-A8ZT-06
TCGA-WE-A8ZX-06
TCGA-WE-A8ZY-06
TCGA-WE-AA9Y-06
TCGA-WE-AAA0-06
TCGA-WE-AAA3-06
TCGA-WE-AAA4-06
TCGA-YD-A89C-06
TCGA-YD-A9TA-06
TCGA-YD-A9TB-06
TCGA-YG-AA3O-06
TCGA-YG-AA3P-06
TCGA-Z2-A8RT-06
TCGA-Z2-AA3S-06
TCGA-Z2-AA3V-06

probably don't want to filter these samples either (we can infer its acronym.)

Hmm this seems like an upstream issue and I'd prefer an upstream fix rather than hacking it ourselves. I propose we merge this PR and deal with this issue subsequently.

@dhimmel dhimmel merged commit 93e4c53 into cognoma:master Apr 16, 2018
@dhimmel dhimmel deleted the zero-mutation-samples branch April 16, 2018 14:09
@gwaybio
Copy link
Member

gwaybio commented Apr 16, 2018

Ah, this is interesting, and now that I think about it, totally expected.

If I am remembering correctly (haven't confirmed) the clinical data stores mostly 01 sample-types. 01 refers to Primary Solid Tumor (dictionary here). So, all of the 06 tumors (Metastatic) will be dropped! Even though the clinical data should match for the patient id instead of the sample id.

I am not sure where the upstream fix of this should live. Perhaps we should investigate sample specific vs. patient specific clinical data and merge mutation/gene exp calls on patient ID after the first merge on sample_id while retaining only patient specific identifiers (age, acronym, etc.) for these samples.

@dhimmel
Copy link
Member Author

dhimmel commented Apr 16, 2018

I think it's simpler to not include metastatic tumors as there are not that many and they may break the independence between observation assumption of many classifiers (not sure if that really matters).

@gwaybio
Copy link
Member

gwaybio commented Apr 16, 2018

I think it's simpler to not include metastatic tumors as there are not that many

Nearly all of the melanoma tumors (SKCM) are metastatic - these will be dropped if we go this route.

@gwaybio gwaybio mentioned this pull request Apr 16, 2018
@dhimmel
Copy link
Member Author

dhimmel commented Apr 16, 2018

Turns out there are 389 missing samples:

FYI, I think this is not true and instead results from us filtering by sample types earlier in the notebook:

# Keep only these sample types
# filters duplicate samples per patient
sample_types = {
'Primary Solid Tumor',
'Primary Blood Derived Cancer - Peripheral Blood',
}
clinmat_df.query("sample_type in @sample_types", inplace=True)

Hence I gave my comment above a 👎 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants